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Abstract. We introduce an adaptive output-sensitive Metropolis-Hast¬ 
ings algorithm for probabilistic models expressed as programs, Adap¬ 
tive Lightweight Metropolis-Hastings (AdLMH). The algorithm extends 
Lightweight Metropolis-Hastings (LMH) by adjusting the probabilities 
of proposing random variables for modification to improve convergence 
of the program output. We show that AdLMH converges to the correct 
equilibrium distribution and compare convergence of AdLMH to that of 
LMH on several test problems to highlight different aspects of the adap¬ 
tation scheme. We observe consistent improvement in convergence on the 
test problems. 
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1 Introduction 

One strategy for improving convergence of Markov Chain Monte Carlo (MCMC) 
samplers is through online adaptation of the proposal distribution mm- An 
adaptation scheme must ensure that the sample sequence converges to the cor¬ 
rect equilibrium distribution. In a componentwise updating Metropolis-Hastings 
MCMC, i.e. Metropolis-within-Gibbs mm, the proposal distribution can be 
decomposed into two components: 

1. A stochastic schedule (probability distribution) for selecting the next random 
variable for modification. 

2. The kernels from which new values for each of the variables are proposed. 

In this paper we concentrate on the first component—adapting the schedule for 
selecting a variable for modification. 

Our primary interest in this work is to improve MCMC methods for proba¬ 
bilistic programming 161211111^ ■ Probabilistic programming languages facilitate 
development of general probabilistic models using the expressive power of gen¬ 
eral programming languages. The goal of inference in such programs is to reason 
about the posterior distribution over random variates that are sampled during 
execution, conditioned on observed values that constrain a subset of program 
expressions. 
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Lightweight Metropolis-Hastings (LMH) samplers [15] propose a change to 
a single random variable at each iteration. The program is then rerun, reusing 
previous values and computation where possible, after which the new set of sam¬ 
ple values is accepted or rejected. While re-running the program each time may 
waste some computation, the simplicity of LMH makes developing probabilistic 
variants of arbitrary languages relatively straightforward. 

Designing robust Adaptive MCMC methods for probabilistic programming is 
complicated because of diversity of models expressed by probabilistic programs. 
The same adaption scheme should perform well with different programs without 
manual tuning. Here we present an adaptive variant of LMH, which dynamically 
adjusts the schedule for selecting variables for modification. First, we review 
the general structure of a probabilistic program. We discuss convergence crite¬ 
ria with respect to the program output and propose a scheme for tracking the 
“influence” of each random variable on the output. We then adapt the selection 
probability for each variable, borrowing techniques from the upper confidence 
bound (UCB) family of algorithms for multi-armed bandits [3|. We show that 
the proposed adaptation scheme preserves convergence to the target distribution 
under reasonable assumptions. Finally, we compare original and Adaptive LMH 
on several test problems to show how convergence is improved by adaptation. 

2 Preliminaries 

2.1 Probabilistic Program 

A probabilistic program is a stateful deterministic computation V with the fol¬ 
lowing properties: 

— Initially, V expects no arguments. 

— On every call, V returns either a distribution F, a distribution and a value 
{G,y), a value z, or T. 

— Upon returning F, V expects a value x drawn from F as the argument to 
the next call. 

— Upon returning (G, y) or z, V is invoked again without arguments. 

— Upon returning T, V terminates. 

A program is run by calling V repeatedly until termination. 

Every run of the program implicitly produces a sequence of pairs {Fi,Xi) of 
distributions and values of latent random variables. We call this sequence a trace 
and denote it by x. A trace induces a sequence of pairs {Gj,yj) of distributions 
and values of observed random variables. We call this sequence an image and 
denote it by y. We call a sequence of values an output of the program and 
denote it by z. Program output is deterministic given the trace. 

The probability of a trace is proportional to the product of the probability 
of all random choices x and the likelihood of all observations y 

1^1 |2/I 

p{x) OC Y[pF,{x,)Y[pG,iyj). 

i=l j=l 


(1) 
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Algorithm 1 Adaptive componentwise MH 
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Select initial point . 

Set initial selection probabilities 


for t = 1 ... 00 do 

OL i J (Q! ,X ,X ). 

Choose k £ {1,..., N} with probability a\. 
Generate x' ~ qg{x\x‘'). 

. ( 7T(x')qa(x^\x') 

x^^^ <— x' with probability p, x^ otherwise. 

end for 


The objective of inference in probabilistic program V is to discover the distribu¬ 
tion of 2 :. 


2.2 Adaptive Markov Chain Monte Carlo 


MCMC methods generate a sequence of samples {x‘}“ ^ by simulating a Markov 
chain using a transition operator that leaves a target density 7r(x) invariant. In 
MH the transition operator is implemented by drawing a new sample x' from a 
parameterized proposal distribution qg{x'\x*} that is conditioned on the current 
sample x *. The proposed sample is then accepted with probability 


p = min 


/ TT{x')qg{x^\x') \ 

\7r(x‘)g'e(x'|x‘)’ ) 


( 2 ) 


If x' is rejected, is re-used as the next sample. 

The convergence rate of MH depends on parameters 9 of the proposal distri¬ 
bution qg. The parameters can be set either offline or online. Variants of MCMC 
in which the parameters are continuously adjusted based on the features of the 
sample sequence are called adaptive. Challenges in design and analysis of Adap¬ 
tive MCMC methods include optimization criteria and algorithms for the pa¬ 
rameter adaptation, as well as conditions of convergence of Adaptive MCMC to 
the correct equilibrium distribution [T3]. Continuous adaptation of parameters 
of the proposal distribution is a well-known research subject mm- 

In a componentwise MH algorithm [10] which targets a density tt{x) de¬ 
fined on an A-dimensional space A, the components of a random sample x = 
{xi,..., xjv} are updated individually, in either random or systematic order. 
Assuming the component i is selected at the step t for modification, the pro¬ 
posal x' sampled from qg{x\x*) may differ from x* only in that component, 
and xl = X* for all j ^ i. Adaptive componentwise Metropolis-Hastings (Algo¬ 
rithmic chooses different probabilities for selecting a component for modification 
at each iteration. Parameters of this scheduling distribution may be viewed as a 
subset of parameters 9 of the proposal distribution qg, and adjusted according 
to optimization criteria of the sampling algorithm. 
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Varying selection probabilities based on past samples violates the Markov 
property of However, provided the adaptation of the selection probabili¬ 
ties diminishes, with ||a‘ — 0, then under suitable regularity conditions 

for the target density (see Section an adaptive componentwise MH algorithm 
will still be ergodic [5] , and the distribution on x induced by Algorithm con¬ 
verges to TT. 


2.3 Lightweight Metropolis-Hastings 


LMH |15) is a sampling scheme for probabilistic programs where a single random 
variable drawn in the course of a particular execution of a probabilistic program 
is modified via a standard MH proposal, and this modification is accepted by 
comparing the values of the joint probability of old and new program traces. 
LMH differs from componentwise MH algorithms in that other random variables 
may also have to be modified, depending on the structural dependencies in the 
probabilistic program. 

LMH initializes a proposal by selecting a single variable from an execution 
trace x and resampling its value x'). either using a reversible kernel K{x'f.\xk) 
or from the conditional prior distribution. Starting from this initialization, the 
program is rerun to generate a new trace x'. For each m > k, the previous 
value Xm is reused, provided it still lies in the support of the distribution on x'^, 
rescoring its log probability relative to the new random choices {x [,..., x'^_i}. 
When x'j^ cannot be rescored, a new value is sampled from the prior on a;(„, 
conditioned on preceding choices. The acceptance probability plmh is obtained 
by substituting Q into ([^: 


Plmh = min 


/ p{y'\x')p{x')q{x\x') \ 
V ’ P{y\x)p{x)q{x'\x) ) 


( 3 ) 


We here further simplify LMH by assuming x'^ is sampled from the prior condi¬ 
tioned on earlier choices and that all variables are selected for modihcation with 
equal probability. In this case, plmh takes the form m 


_ . / p{y'\x')p{x')\x\p{x\x'\xr\x') \ 

Plmh mm 


( 4 ) 


where x'\x denotes the resampled variables, and x' Cl x denotes the variables 
which have the same values in both traces. 


3 Adaptive Lightweight Metropolis-Hastings 

We develop an adaptive variant of LMH, which dynamically adjusts the proba¬ 
bilities of selecting variables for modihcation (Algorithm]^. Let x* be the trace 
at iteration t of Adaptive LMH. We dehne the probability distribution of se¬ 
lecting variables for modihcation in terms of a weight vector W* that we adapt. 
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Algorithm 2 Adaptive LMH 
1: Initialize W® to a constant. 

2: Run the program. 

3: for t = 1... 00 do 

4: Randomly select a variable x^. according to 

5: Propose a value for x\. 

6: Run the program, accept or reject the trace. 

7: if accepted then 

8: Compute based on the program output. 

9: else 

10 : ^ 

11: end if 

12: end for 


such that the probability a* of selecting the variable for modification is 


\xt\ 

E wi 


( 5 ) 


Just like LMH, Adaptive LMH runs the probabilistic program once and then 
selects variables for modification randomly. However, acceptance ratio paulmh 
must now include selection probabilities and a'f. of the resampled variable in 
the current and the proposed sample 


PAdLMH = min 


/ p{y' \x')p{x')a'^p{x\x'\x n x') \ 
V ’ p{y\x)p{x)akp{x'\x\x' \^x) J 


( 6 ) 


This high-level description of the algorithm does not detail how W* is com¬ 
puted for each iteration. Indeed, this is the most essential part of the algorithm. 
There are two different aspects here — on one hand, the influence of a given 
choice on the output sequence must be quantified in terms of convergence of 
the sequence to the target distribution. On the other hand, the influence of the 
choice must be translated into re-computation of weights of random variables in 
the trace. Both parts of re-computation of W* are explained below. 


3.1 Quantifying Influence 

Extensive research literature is available on criteria for tuning parameters of 
Adaptive MCMC [mill!. The case of inference in probabilistic programs is 
different: the user of a probabilistic program is interested in fast convergence 
of the program output {z*} rather than of the vector of the program’s random 
variables {x‘}. 

In adaptive MCMC variants the acceptance rate can be efficiently used as 
the optimization objective M- However, for convergence of the output sequence 
an accepted trace that produces the same output is indistinguishable from a 
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rejected trace. Additionally, while optimal values of the acceptance rate mm 
can be used to tune parameters in adaptive MCMC, in Adaptive LMH we do 
not change the parameters of proposal distributions of individual variables, and 
assume that they are fixed. However, proposing a new value for a random variable 
may or may not change the output even if the new trace is accepted. By changing 
variable selection probabilities we attempt to maximize the change in the output 
sequence so that it converges faster. In the pedagogical example 

xi ~Bernoulli(0.5), X 2 ~ A/'(xi, 1), 

Zi <—{Xi,X2), 


selecting the Bernoulli random choice for modification changes the output only 
when a different value is sampled, while selecting the normal random choice will 
change the output almost always. 

Based on these considerations, we quantify the influence of sampling on the 
output sequence by measuring the change in the output 2 : of the probabilistic 
program. Since probabilistic programs may produce output of any type, we chose 
to discern between identical and different outputs only, rather than to quantify 
the distance by introducing a type-dependent norm. In addition, when \z\ > I, 
we quantify the difference by the fraction of components of z with changed 
values. 

Formally, let {z‘}“ = {z^,... ...} be the output sequence of a 

probabilistic program. Then the influence of a choice that produced z‘ is defined 
by the total reward i?*, computed as normalized Hamming distance between the 
outputs 


R* = 



|A| 




(7) 


In the case of a scalar z, the reward is 1 when outputs of subsequent samples 
are different, 0 otherwise. 

The reward is used to adjust the variable selection probabilities for the sub¬ 
sequent steps of Adaptive LMH by computing (line of Algorithm]^. 

It may seem sufficient to assign the reward to the last choice and use average 
choice rewards as their weights, but this approach will not work for Adaptive 
LMH. Consider the generative model 


xi 10), X2 ~ A/’(xi, 1), 

yi r^M{x 2 , 1 ), 

Zi <—Xi, 


where we observe the value yi = 2. Modifying X 2 may result in an accepted 
trace, but the value of Zi = xi, predicted by the program, will remain the same 
as in the previous trace. Only when Xi is also modified, and a new trace with 
the updated values for both xi and X 2 is accepted, the earlier change in X 2 is 
indirectly reflected in the output of the program. In the next section, we discuss 
propagation of rewards to variable selection probabilities in detail. 
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3.2 Propagating Rewards to Variables 

Both LMH and Adaptive LMH modify a single variable per trace, and either 
re-use or recompute the probabilities of values of all other variables (except 
those absent from the previous trace or having an incompatible distribution, for 
which new values are also sampled). Due to this updating scheme, the influence 
of modifying a variable on the output can be delayed by several iterations. We 
propose the following propagation scheme: for each random variable a;*, the re¬ 
ward Ti and count Ci are kept in a data structure used to compute W. The set 
of variables selected for modification, called here the history, is maintained for 
each component Zk of output z. When the value of Zk changes, the reward is 
distributed between all of the variables in the history, and the history is emp¬ 
tied. When Zfc does not change, the selected variable is penalized by zero reward. 
This scheme, for the case of scalar output for simplicity, is shown in Algorithm]^ 
which expands linej^of Algorithmj^ When z has multiple components, histories 
for each component are maintained independently. 


Algorithm 3 Propagating Rewards to Variables 
1: Append to the history of variables selected for modification. 

2: if 7 ^ z* then 

^ ^ \history\ 

4: for Xm in history do 

5: fm^Tyn+W, Cm Cm + W 

6: end for 

7: Flush the history. 

8: else 

9: Ck-<^ Ck + 1 

10: end if 


Rewarding all of the variables in the history ensures that while variables 
which cause changes in the output more often get a greater reward, variables 
with lower influence on the output are still selected for modification sufficiently 
often. This, in turn, ensures ergodicity of sampling sequence, and helps establish 
conditions for convergence to the target distribution, as we discuss in Section]^ 
Let us show that under certain assumptions the proposed reward propaga¬ 
tion scheme has a non-degenerate equilibrium for variable selection probabilities. 
Indeed, assume that for a program with two variables, Xi, and X 2 , probability 
matching, or selecting a choice with the probability proportional to the unit 
reward pi = is used to compute the weights, that is, Wi = pi. Then, the 
following lemma holds: 

Lemma 1 Assume that for variables Xi, where i S {1,2}; 

— at is the selection probability; 

— f3i is the probability that the new trace is accepted given that the variable was 
selected for modification; 
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— 7 i is the probability that the output changed given that the trace was accepted. 

Assume further that oti, jSi, and are constant. Then for the case 71 = 1, 
72 = O; 


0 < 


^ 1 
ai ~ 3 


( 8 ) 


Proof. We shall proof the lemma in three steps. First, we shall analyse a sequence 
of samples between two subsequent arrivals of Xi. Then, we shall derive a formula 
for the expected unit reward of X 2 . Finally, we shall bound the ratio 

Consider a sequence of k samples, for some k, between two subsequent arrivals 
of xi, including the sample corresponding to the second arrival of xi. Since a 
new value of xi always (71 = 1 ) and X 2 never (72 = 0 ) causes a change in the 
output, at the end of the sequence the history will contain k occurrences of X 2 . 
Let us denote by Ar^, Aci the increase of reward and count Ci between the 
beginning and the end of the sequence. Noting that X 2 is penalized each time 
it is added to the history (line of Algorithm 3), and k occurrences of X 2 are 
rewarded when xi is added to the history (line 5 of Algorithm]^, we obtain 


Ari = 


1 


, Aci = 


k + 1 


Ary, = 


k + 1 


Ac 2 = k 


k + 1 


(9) 


Consider now a sequence of M such sequences. When M —>• 00 , approaches 
the expected unit reward Pi, where and c^m are the reward and the count 
of Xi at the end of the sequence. 


_ T ^ iM 

= lim - 

M->-oo CiM 


= lim 


ViM 

M 


= lim 


M 


M—>-00 M—>-00 


M 


/\r ■ 

<“> 


Each variable Xi is selected randomly and independently and produces an 
accepted trace with probability 


Pi = 


^i(di 


O-lPl +02/32' 


( 11 ) 


Acceptances of Xi form a Poisson process with rate ^ ^ is dis¬ 

tributed according to the geometric distribution with probability pi, Pr[fc] = 
(1 — Pi)^Pi. Since Ari = Aci for any fc, the expected unit reward Pi of Xi is 1. 
We shall substitute Ari and Aci into (10) to obtain the expected unit reward 
P 2 of X 2 . 


_ h. 

= £ (* + til) (1 - n)+ = £7^ + £ £/t(i - J-i+i (12) 


/c =0 


k 
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P2 = 


Ar2 

Aco 




/c =0 


i-E 


k^O 


k I 


(I-PI)V 


k—0 




fc =0 


(13) 


For probability matching, selection probabilities are proportional to expected 
unit rewards: _ 

0^2 _ P2 

ai Pi 

To prove the inequality, we shall derive a closed-form representation for p 2 i 


(14) 


analyse solutions of (14) for We shall eliminate the summation A in (13): 




Pi 




k 1 


(i-Pi)'=+' 


1-Pi 4^0 


By substituting A into (13), and then pi and p 2 into (14), we obtain 


_ 1 I Pi log Pi 

^ P2 ^ ^ ^ I-Pl 

Q-l “ p - P2- j_ Pi log Pi 

^ Pi i-pi 


B{pi 


(15) 


(16) 


The right-hand side B{pi) of (161 is a monotonic function for pi € [0,1], and 


-B(O) = 0, B{1) — ^ (see Appendix for the analysis of B{pi)). According to (11), 

3’ 


062 _ 


061 


= 0 implies pi = 1, hence ^ ^ 0, and 0 < (^ < 


ai 


□ 


By noting that any subset of variables in a probabilistic program can be 
considered a single random variable drawn from a multi-dimensional distribution. 
Lemmais generalized to any number of variables by Corollary]^ 

Corollary 1 For any partitioning of the set x of random variables of a prob¬ 
abilistic program, AdLMH with weights proportional to expected unit rewards 
selects variables from each of the partitions with non-zero probability. 

To ensure convergence of W* to expected unit rewards in the stationary 
distribution, we use upper confidence bounds on unit rewards to compute the 
variable selection probabilities, an idea which we borrowed from the UCB family 
of algorithms for multi-armed bandits [3]. Following UCBl j3], we compute the 
upper confidence bound pi as the sum of the unit reward and the exploration 
term 

EgESE 


Pi — Pi ~\~ C \ 


(17) 
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where C is an exploration factor. The default value for C is y/2 in UCBl; in 
practice, a lower value of C is preferable. Note that variable selection in Adaptive 
LMH is different from arm selection in multi-armed bandits: unlike in bandits, 
where we want to sample the best arm at an increasing rate, in Adaptive LMH 
we expect W* to converge to an equilibrium in which selection probabilities are 
proportional to expected unit rewards. 


4 Convergence of Adaptive LMH 


As adaptive MCMC algorithms may depend arbitrarily on the history at each 
step, showing that a given sampler correctly draws from the target distribution 
can be non-trivial. General conditions under which adaptive MCMC schemes 
are still ergodic, in the sense that the distribution of samples converges to the 
target tt in total variation, are established in m- The fundamental criteria for 
validity of an adaptive algorithm are diminishing adaptation, which (informally) 
requires that the amount which the transition operator changes each iteration 
must asymptotically decrease to zero; and containment, a technical condition 
which requires that the time until convergence to the target distribution must 
be bounded in probability [3]. 

The class of models representable by probabilistic programs is very broad, 
allowing specification of completely arbitrary target densities; however, for many 
models the adaptive LMH algorithm reduces to an adaptive random scan Metro¬ 
polis-within-Gibbs in Algorithmic To discuss when this is the case, we invoke the 
concept of structural versus structure-preserving random choices na. Crucially, 
a structure-preserving random choice Xk does not affect the existence of other 
Xm in the trace. 

Suppose we were to restrict the expressiveness of our language to admit only 
programs with no structural random choices: in such a language, the LMH al¬ 
gorithm in Algorithm [C reduces to the adaptive componentwise MH algorithm. 
Conditions under which such an adaptive algorithm is ergodic have been estab¬ 
lished explicitly in [51 Theorems 4.10 and 5.5]. Given suitable assumptions on the 
target density defined by the program, it is necessary for the probability vector 
jja* — —>■ 0, and that for any particular component k we have probability 

> e > 0. Both of these are satisfied by our approach: from Corollary we 
ensure that the unit reward across each Xi converges to a positive fixed point. 

While any theoretical result will require language restrictions such that pro¬ 
grams only induce distributions satisfying regularity conditions, we conjecture 
that this scheme is broadly applicable across most non-pathological programs. 
We leave a precise theoretical analysis of the space of probabilistic programs in 
which adaptive MCMC schemes (with inhnite adaptation) may be ergodic to 
future work. Empirical evaluation presented in the next section demonstrates 
practical convergence of Adaptive LMH on a range of inference examples, in¬ 
cluding programs containing structural random choices. 
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5 Empirical Evaluation 

We evaluated Adaptive LMH on many probabilistic programs and observed con¬ 
sistent improvement of convergence rate compared to LMH. We also verified 
on a number of tests that the algorithm converges to the correct distribution 
obtained by independent exact methods. In this section, we compare Adaptive 
LMH to LMH on several representative examples of probabilistic programs. The 
rates in the comparisons are presented with respect to the number of samples, or 
simulations, of the probabilistic programs. The additional computation required 
for adaptation takes negligible time, and the computational effort per sample is 
approximately the same for all algorithms. Our implementation of the inference 
engine is available at https://bitbucket.org/dtolpin/embang 

In the following case studies differences between program outputs and target 
distributions are presented using KullBack-Leibler (KL) divergence, Kolmogorov- 
Smirnov (KS) distance, or L2 distance, as appropriate. In cases where target 
distributions cannot be updated exactly, they were approximated by running a 
non-adaptive inference algorithm for a long enough time and with a sufficient 
number of restarts. In each of the evaluations, all of the algorithms were run 
with 25 random restarts and 500 000 simulations of the probabilistic program 
per restart. The difference plots use the logarithmic scale for both axes. In the 
plots, the solid lines correspond to the median, and the dashed lines to 25% and 
75% percentiles, taken over all runs of the corresponding inference algorithm. 
The exploration factor for computing upper confidence bounds on unit rewards 
(Equation]^ was fixed at C = 0.5 for all tests and evaluations. 

The first example is a latent state inference problem in an HMM with three 
states, one-dimensional normal observations (0.9, 0.8, 0.7, 0, -0.025, 5, 2, 0.1, 
0, 0.13, 0.45, 6, 0.2, 0.3, -1, -1) with variance 1.0, a known transition matrix, 
and known initial state distribution. There are 18 distinct random choices in all 
traces of the program, and the 0th and the 17th state of the model are predicted. 
The results of evaluation are shown in Figure]^ as KL divergences between the 
inference output and the ground truth obtained using the forward-backward 
algorithm. In addition, bar plots of unit reward and sample count distributions 
among random choices in Adaptive LMH are shown for 1000, 10 000, and 100 000 
samples. 

As can be seen in the plots. Adaptive LMH (black) exhibits faster convergence 
over the whole range of evaluation, requiring half as many samples as LMH 
(cyan) to achieve the same approximation, with the median of LMH above the 
75% quantile of Adaptive LMH. 

In addition, the bar plots show unit rewards and sample counts for different 
random choices, providing an insight on the adaptive behavior of AdLMH. On 
the left-hand bar plots, red bars are normalized unit rewards, and blue bars are 
normalized sample counts. On the right-hand bar plots, the total height of a bar 
is the total sample count, with green section corresponding to the accepted, and 
yellow to the rejected samples. At 1 000 samples, the unit rewards have not yet 
converged, and exploration supersedes exploitation: random choices with lower 
acceptance rate are selected more often (choices 7, 8 and 13 corresponding to 
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Fig. 1. HMM, predicting the 0th and 17th state 


states 6, 7 and 12). At 10 000 samples, the unit rewards become close to their 
final values, and choices 1 and 18, immediately affecting the predicted states, 
are selected more often. At 100 000 samples, the unit rewards converge, and the 
sample counts correspond closely to the equilibrium state outlined in Lemma 

The second case study is estimation of hyperparameters of a Gaussian Pro¬ 
cess. We dehne a Gaussian Process of the form 

/ ^gv{m,k), 

where m{x) =ax + bx + c, k{x, x) = de 

The process has five hyperparameters, a, b, c, d, g. The program infers the poste¬ 
rior values of the hyperparameters by maximizing marginal likelihood of 6 ob¬ 
servations (0.0,0.5), (1.0,0.4), (2.0,0.2), (3.0,-0.05), (4.0,-0.2), and (5.0,0.!). 
Parameters a, 6, c of the mean function are predicted. Maximum of KS distances 
between inferred distributions of each of the predicted parameters and an ap¬ 
proximation of the target distributions is shown in Figure The approximation 
was obtained by running LMH with 2 000 000 samples per restart and 50 restarts, 
and then taking each 100th sample from the last 10 000 samples of each restart, 
5000 samples total. Just as for the previous case study, bar plots of unit rewards 
and sample counts are shown for 1000, 10 000, and 100 000 samples. 

Here as well. Adaptive LMH (black) converges faster over the whole range 
of evaluation, outperforming LMH by a factor 2 over the first 50 000 samples. 
Bar plots of unit rewards and sample counts for different number of choices, 
again, show the dynamics of sample allocation among random choices. Choices 
a, 5, and c are predicted, while choices d and g are required for inference but 
not predicted. Choice a has the lowest acceptance rate (ratio between the total 
height of the bar and the green part on the right-hand bar plot), but the unit 
reward is close the unit reward of choices b and c. At 1 000 samples, choice a is 
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Fig. 2. Gaussian process hyperparameter estimation 


selected with the highest probability. However, close to the converged state, at 
100 000 samples, choices a, b, and c are selected with similar probabilities. At 
the same time, choices 4 and 5 are selected with a lower probability. Both the 
exploration-exploitation dynamics for choices a-b and probability matching of 
selection probabilities among all choices secure improved convergence. 

The third case study involves a larger amount of data observed during each 
simulation of a probabilistic program. We use the well-known Iris dataset [9] to 
fit a model of classifying a given flower as of the species Iris setosa, as opposite 
to either Iris virginica or Iris versicolor. Each record in the dataset corresponds 
to an observation. For each observation, we define a feature vector x and an 
indicator variable Zi, which is 1 if and only if the observation is of an Iris setosa. 
We fit the model with five regression coefficients /3i,..., /Ss, defined as 

cr^ ~ InvGamma(l, 1), 

Pj ^ Normal(0, cr), 

=') = 

To assess the convergence, we perform shuffle split leave-2-out cross validation 
on the dataset, selecting one instance belonging to the species Iris setosa and 
one belonging to a different species for each run of the inference algorithm. The 
classification error is shown in Figure over 100 runs of LMH and Adaptive 
LMH. 

The results are consistent with other case studies: Adaptive LMH exhibits 
a faster convergence rate, requiring half as many samples to achieve the same 
classification accuracy as LMH. 
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Fig. 3. Logistic regression on Iris dataset. 


LMH 


AdLMH 


3 




0.05 0.10 0.15 0.20 0.25 

q 


0.05 0.10 0.15 0.20 0.25 

q 


Fig. 4. Kalman filter, 500 samples after 10 000 samples of burn-in. 


As a final case study we consider a linear dynamical system (i.e. a Kalman 
smoothing problem) that was previously described in [12] 

Xt ~ Norm(A • Xt-i,Q), ^ Norm(C • ait, R)- 

In this problem we assume that 16-dimensional observations are conditioned 
on 2-dimensional latent states Zt. We impose additional structure by assuming 
that the transition matrix A is a simple rotation with angular velocity w, whereas 
the transition covariance Q is a diagonal matrix with coefficient q, 

q O' 

0 (z ■ 

We predict posterior values for w, and g in a setting where C and R are as¬ 
sumed known, under mildly informative priors ui ~ Gamma(10, 2.5) and q ~ 
Gamma(10,100). Posterior inference is performed conditioned on a simulated 


A = 


cos UI — sm UI 
sin UI cos UI 


Q = 
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sequence of T = 100 observations, with w* = 47r/r, and q* = 0.1. The 
observation matrix C and covariance R are sampled row-wise from symmetric 
Dirichlet distributions with parameters c = 0.1, and r — 0.01 respectively. 

Figure shows a qualitative evaluation the mixing rate in the form of 500 
consecutive samples (w, q) from an LMH and AdLMH chain after 10 000 samples 
of burn-in. The LMH sequence exhibits good mixing over w but is strongly 
correlated in q, whereas the AdLMH sequence obtains a much better coverage 
of the space. 

To summarize, Adaptive LMH consistently attained faster convergence than 
LMH, measured by differences between the ongoing output distribution of the 
random program and the target independently obtained distribution, assessed 
using various metrics. Variable selection probabilities computed by Adaptive 
LMH are dynamically adapted during the inference, combining exploration of 
the model represented by the probabilistic program and exploitation of influence 
of random variables on program output. 


6 Contribution and Future Work 

In this paper we introduced a new algorithm, Adaptive LMH, for approximate 
inference in probabilistic programs. This algorithm adjusts sampling parame¬ 
ters based on the output of the probabilistic program in which the inference is 
performed. Contributions of the paper include 

— A scheme of rewarding random choice based on program output. 

— An approach to propagation of choice rewards to MH proposal scheduling 
parameters. 

— An application of this approach to LMH, where the probabilities of selecting 
each variable for modification are adjusted. 

Adaptive LMH was compared to LMH, its non-adaptive counterpart, and was 
found to consistently outperform LMH on several probabilistic programs, while 
still being almost as easy to implement. The time cost of additional computation 
due to adaptation was negligible. 

Although presented in the context of a particular sampling algorithm, the 
adaptation approach can be extended to other sampling methods. We believe 
that various sampling algorithms for probabilistic programming can benefit from 
output-sensitive adaptation. Additional potential for improvement lies in ac¬ 
quisition of dependencies between predicted expressions and random variables. 
Exploring alternative approaches for guiding exploration-exploitation compro¬ 
mise, in particular, based on Bayesian inference, is another promising research 
direction. 

Overall, output-sensitive approximate inference appears to bring clear advan¬ 
tages and should be further explored in the context of probabilistic programming 
models and algorithms. 
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